Familiarise yourself with RStudio and R Notebooks, which is what we’ll use to interact with R.
Learn about the simple data structures in R: object, vector, and data frame.
Explore R’s basic data types = integer, character, numeric
Learn where to go for help.
-Extra time
Learn to read data into R.
Create simple visualisations using ggplot2.
R is a versatile coding language for data science, with a wonderful community supporting it. Here’s a short list of some of the things that make R great.
Free and open source It’s a free and open source programming language and environment for statistical computing, data science, and data visualization
Reproducibility and Reporting writing reproducible reports is now easier than ever thanks to packages like knitr and R Markdown. R, beyond being a computer language, is a flexible system that allows the entire data analysis process, from data collection to reporting, to be carried out without leaving the same environment provided by RStudio. This makes analysis less error-prone and facilitates external scrutiny.
RStudio RStudio is a powerful Interactive Development Environment that has made learning R and using R much easier. With options for workflow and project management.
Data Visualization R can be used to make great data graphics, with packages like ggplot2 helping users make graphics in an intuitive way.
R Packages and Community With over 18,000 packages on CRAN alone, there’s pretty much a package to do anything. The greater R community has also expanded tremendously over time, bringing in new users and pushing R to be useful in more applications. Each year there are thousands of meetups, conferences, seminars, and workshops on R all around the world. Check out R User conference (https://user2022.r-project.org)
R and RStudio often cause confusion, however they serve different purposes. We can compare it to a car and its parts. R is like a car’s engine while RStudio is like a car’s dashboard.
RStudio is an interactive development environment (IDE) for R and is structured into various panels/ sections for easy of use. The borders of these can be dragged to expand or contract individual panes, and can be arranged as you prefer by clicking Tools > Global Options > Pane Layout in the menu. Some can also be detached from the main window to be made fullscreen.
Source: R-Ladies Sydney Basic Basics.
There are many other features. See https://www.rstudio.com/products/rstudio/.
First let’s set it so that our notebook shows up in our viewer.
Some people like RStudio to remember stuff from session to session. However, this can be dangerous as previous work and packages can interfere with current code and make your code more breakable. To avoid this, it is recommended that you change two settings in RStudio.
Locate Preferences (On Windows, this is in the Tools->Global Options menu; on a Mac, this is in the RStudio menu). In the General tab, uncheck “Restore .RData…” and select “never” for “Save workspace…”
Source: R-Ladies Sydney Basic Basics.
R Notebooks give the opportunity to combine code and description in a single human-readable notebook. You can conduct analysis and give interpretation side-by-side! We will focus on using R Studio to produce R Notebooks, a particular type of R Markdown document.
An R markdown notebook has the following characteristics
Reproducible workflow by default.
Code, output and narrative in a single document.
Output to multiple formats (html, PDF, Word).
Output inline with code: When you execute code within the notebook, the results appear beneath the code.
It is possible to share output and code in a single file.
The notebook source code is a .Rmd file
Notebooks have an associated HTML file (.nb.html). It is created automatically when the .Rmd is saved. When you save a notebook, an HTML file containing the code and output will be saved alongside it (click the Preview button or press Cmd+Shift+K to preview the HTML file).
The preview shows you a rendered HTML copy of the contents of the editor. Consequently, unlike Knit, Preview does not run any R code chunks. Instead, the output of the chunk when it was last run in the editor is displayed.
When a .nb.html file is opened in RStudio it will automatically open a .Rmd file with the source code.
We will be entering the R code into blocks of codes and they are
called “chunks”. The chunks can be executed independently and
interactively, with output visible immediately beneath the input.
Notebook chunks are inserted using the keyboard shortcut Ctrl + Alt + I
(macOS: Cmd + Option + I), or via the Insert menu in the editor toolbar.
Also, a chunk can be inserted manually by typing
{r} ` ,then type the code in the next line. lastly, close the chunk by typing
` in a separate line.
#add comment
print('code goes here!')
## [1] "code goes here!"
Note,You can write comments inside a chunk. * You can comment or
uncomment code using Ctrl + Shift + C. Comments are always
preceded by #. our comments can be notes for yourself, or explanation of
what the code is doing for someone to follow. We can run the code chunk
use the play button on the right.
So far, we have explored R and how they work together with RStudio. However, it is much much more than that! One of the things that makes R amazing is the open source community surrounding it.
The R community which is made up of academics,software engineers, statisticians, social and political scientists, economists, and data scientists to name a few, are responsible for authoring a wide variety of packages (>18,000) that can do a wide range of data manipulation, visualisation, and analysis tasks.
To get your head around what CRAN, library, packages, and functions are I find it helpful to think of books.
CRAN stands for the Comprehensive R Archive Network. It’s like the R equivalent of the National Library or National Museum Library. It holds a copy of every package (book) and all the versions of R.
On your computer you’ll have a local library with copies of the packages you’ve installed from CRAN (your home office book shelf).
Click on the ‘Packages’ tab in the lower right hand panel (Q4 from before). You can see what packages are in your library, a short description of what they do, and the package version.
The packages that are loaded have a check mark in the box on the
left. As before, there are several packages that are automatically
loaded each time you start an R session, e.g. base
package.
Although it is possible to load and install your packages from here, I recommend using the functions shown below instead. This way, someone else or future you knows exactly what packages they need to run the analyses.
You should load the packages you will use at the top of your script, so that future you or your colleague knows what needs to be installed/loaded.
A package needs to be installed only once and requires an internet connection which allows your computer to communicate with the CRAN server.
You may wish to install a package with the additional argument:
dependencies = TRUE, this will also install any packages
that the package depends on.
On your personal computer, you can install a package to your local library from CRAN by uncommenting and running the following:
# install.packages("tidyverse")
# install.packages("ggplot2", dependencies = TRUE)
However, if you are on a government laptop without elevated access rights, you might need to verify with the IT department on the best ways to deal with the installations.
In order to use the package you need to load it to your workspace. This needs to be done each time you start a new RStudio session or project. Think of it as taking the book you will use off your book shelf to place next to you on the desk.
In this case, tidyverse is a meta-package, which
actually contains several individual packages including
dplyr, forcats, etc., but more on those later.
The tidyverse metapackage is in your library already so we
can simply make a call to load them.
library(tidyverse)
## ── Attaching packages ─────────────────────────────────────── tidyverse 1.3.1 ──
## ✔ ggplot2 3.3.6 ✔ purrr 0.3.4
## ✔ tibble 3.1.7 ✔ dplyr 1.0.9
## ✔ tidyr 1.2.0 ✔ stringr 1.4.0
## ✔ readr 2.1.2 ✔ forcats 0.5.1
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag() masks stats::lag()
Alas, there are not enough names to make each function in every
package unique. The “Conflicts” line that is printed tells us that the
dplyr function filter will mask the
stats package function filter.
If we want to be completely accurate, we can specify the package and
function using the following form
<package_name>::<function_name>,
e.g. dplyr::filter().
If we follow the recipe book analogy, this is like saying we want the
lasagna recipe from jamie_oliver::lasagna so that it isn’t
confused with the nigella_lawson::lasagna recipe.
You can think of a package like a book on a particular subject. Each package is designed to do a specific set of tasks (e.g. data manipulation, implement linear models, draw geographical maps, etc.). Each task is implemented using a function, which is a set of statements organised to complete the task.
A function is like a recipe from a book. It is designed to make one specific thing, e.g. cupcakes or steak and kidney pie. The function takes arguments (e.g. ingredients) and then carries out a series of steps where the ingredients are modified, cooked, combined, etc. to create the final recipe.
Some of these arguments will be optional (e.g. add or don’t add cinnamon), whereas other arguments will be required for the function to run (e.g. you can’t make the cake without flour!).
Functions follow the form:
functionName(argument1 = value1, argument2 = value2, and so on)Let’s take a look at some of the built-in functions R has for
carrying out basic statistics/analysis, starting with
seq().
seq() functionLet’s try using seq() which makes regular sequences of
numbers and, while we’re at it, demo more helpful features of
RStudio.
Type se and hit TAB. A pop up shows you possible
completions.
Specify seq() by typing more to specify the function or using the up/down arrows to select. Notice the floating help box that pops up to remind you of the function’s arguments.
If you want even more help, press F1 as directed to get the full
documentation in the help tab of the lower right pane. You can also
access the help file for a function by typing
?seq.
Now open the parentheses and notice the automatic addition of the closing parenthesis and the placement of cursor in the middle. Type the arguments 1, 10 and hit return. RStudio also exits the parenthetical expression for you.
seq(1,10)
## [1] 1 2 3 4 5 6 7 8 9 10
Let’s take a closer look at the help file for seq().
?seq
Every help file will have a series of sections describing what the function does. I generally focus first on: Description, Usage, Arguments, and Examples.
For example, in the helpfile for seq() under
Description, it tells us it is a function to “Generate
regular sequences”.
We can see that seq() takes the arguments
from, to, and by, and the
optional arguments length.out and
along.with.
Here, we can find out what these arguments are:
from, to: the starting and maximal end values of the
sequence.by number: increment of the sequence.In the code we used above in sequence, we generated a sequence of
numbers from 1 to 10. In this case we did not
supply a value for by, so it took the default value, which
in this case is 1.
Operators in R:
| Operator | Description |
|---|---|
| + | Addition |
| - | Subtraction |
| * | Multiplication |
| / | Division |
| ^ or ** | Exponents |
| %% | Modulo Division |
| %/% | Floor Division |
Let’s have a play with some of these operators:
Let’s start by making an assignment and inspecting the object we created.
It’s also useful to note that R will follow BIDMAS for order of operations.
And as you can see above, the different types of divisors are still treated as division, with respect to order of operations (BIDMAS).
Let’s have a play with some of these operators:
9 + 27.73823
## [1] 36.73823
9 - 99.232131300000001
## [1] -90.23213
18^3 - 18**3
## [1] 0
13%%5
## [1] 3
13%/%5
## [1] 2
It’s also useful to note that R will follow BIDMAS for order of operations.
10 + 11 * 12 / 3 - 5**2
## [1] 29
10 + 11 * 14 %% 3 - 5**2
## [1] 7
10 + 11 * 14 %/% 3 - 5**2
## [1] 29
And as you can see above, the different types of divisors are still treated as division, with respect to order of operations (BIDMAS).
There are also some built-in functions such as abs(-5),
sqrt(9) or round(3.6357, digits = 2). Let’s
see what they do:
abs(-5)
## [1] 5
sqrt(9)
## [1] 3
round(3.6357, digits = 2)
## [1] 3.64
There are many more built-in functions which you will discover with more exposure to R, but for now we will move onto the different types of numbers we might work with. These are: Numeric, Double, Integers and Complex numbers.
The key being that we can recognise or find out what type of number we are working with.
Integers are whole numbers with no decimal place. e.g. …, -2, -1, 0, 1, 2, … etc
Double is treated as the general number type and by default R will treat all numbers you use as double unless you give it an explicit reason to think otherwise. So therefore, any number with or without a decimal place will be treated as double unless emphasised otherwise.
typeof(10.53564)
## [1] "double"
typeof(10)
## [1] "double"
Confusingly enough, under the hood, Double and Numeric are identical and just have different names. From now on we shall only refer to double as the data type and leave numeric to it’s dictionairy definition to avoid confusion.
typeof(10)
## [1] "double"
We can work with integers if we are keen to by affixing an ‘L’ to the end of the number.
typeof(10L)
## [1] "integer"
As you can see that doesn’t affect our ability to use operators on different type of number type. (Curious why? Ask me!)
typeof(10*10L)
## [1] "double"
In R, we refer to text as ‘character’ strings.
Character Strings are:
'' or double quotes
""."This is a string"
## [1] "This is a string"
A few symbols are awkward to get into character strings and this is because of they have an underlying meaning elsewhere. Generally speaking, a backslash prior to the special character you require will solve the problem. If not, Google will definitely have the answer!
For example, if you wanted to include quotation marks inside your string:
writeLines("Then he told Skyler \"I am the one who knocks!\"")
## Then he told Skyler "I am the one who knocks!"
We can do a lot more with character strings in base R such as:
substr - create a sub string of a given string,nchar - returns the number of characters in your
string,sprintf - formats a given string,strsplit - which splits a given string by a given
sub-string or characterAnd many more. Experiment with these.
In R these are written as TRUE or
FALSE.
They are special R data types - not strings! So they don’t need ’’ or “” around them.
They also have numerical value behind them - TRUE is 1 and FALSE is 0.
Therefore, we can treat them as numbers!
TRUE + TRUE
## [1] 2
And use all the other operators too!
TRUE * 2.5
## [1] 2.5
It’s uncommon to use logical values for simple calculations like above. We are far more likely to deal with logical values when we work with logical (or conditional) operators. i.e. To perform comparisons and checks.
| Logical Operator | Description |
|---|---|
| < | Less Than |
| <= | Less Than or Equal To |
| > | Greater Than |
| >= | Greater Than or Equal To |
| == or ** | Equal To |
| != | Not Equal To |
| | | Or |
| & | And |
| %in% | Membership |
(Ignore backslashes in the notebook.) Let’s see how these work:
Is 4 greater than 5?
4 > 5
Is 25 equal to 5 squared?
25 == 5**2
Does 1 equal 2?
1 != 2
That covers the simpler logical operators. The final two allow us to combine and check multiple conditions.
Both Or (|) and And
(&) have to have a condition on either side of them.
They will check the conditions on each side and then return
TRUE or FALSE based on their value.
And (&) will return TRUE
if both conditions are TRUE, FALSE
otherwise.
Or (|) will return TRUE if at
least one of the conditions are TRUE, FALSE
otherwise.
Let’s look at some examples:
Can a statement be both TRUE and FALSE?
TRUE & FALSE
TRUE | FALSE
Or a less abstract equivalent example:
(25 == 5**2) & (4 > 5)
(25 == 5**2) | (4 > 5)
(25 == 5**2) & TRUE
Or more interestingly:
(25 == 5**2) & 5
Given that R is a dynamically typed language, sometimes the data type you want to work with, and the one R infers are not the same.
R doesn’t require you to set the data type when you create it, instead it figures out what the best data type is for the object you are creating - numeric, character, logical, etc.
Sometimes you want to ensure that a particular object is actually a certain type, rather than leaving it up to R. This is done using data type conversion (or casting). R has a range of in-built functions that enable you to convert data from one type to another.
Firstly, the typeof() function returns the (R internal)
type of any given object. e.g.
typeof("is this a character?")
## [1] "character"
typeof(5L)
## [1] "integer"
typeof(5)
## [1] "double"
We could equivalently use methods from the is. class of
functions which checks if our object is a given type. e.g. (set cursor
after is. and hit tab)
is.
is.integer(5)
## [1] FALSE
The way we access the list of methods which allow us to convert from
one data type to another we call the as class. Let’s see
what we can do…
as.
as.integer(4.996453)
## [1] 4
You can name your objects anything. You can use letters, numbers,
periods and underscores. You just can’t start names with a dot or a
number 1,2,3... and your name cannot contain other
characters such as a comma or a space.
this_works <- 10*5
this_works
## [1] 50
Try running the following lines of code. Try uncommenting the code
# this_doesn't_work <- 10*5 by clicking on the line and
using Ctrl + Shift + C.
# this_doesn't_work <- 10*5
It is useful for future you and your collaborators to name your objects something that is reasonable and describes what the object contains. To make your object names easy to read it is useful to adopt a convention for demarcating words in names.
the_total_number_deaths_after_covid
some.people.use.periods
othersUseCamelCase
Make a new object
a_very_long_name <- 7^2
Sometimes to make our object names readable we use long names that can be long to type. Luckily, RStudio has a handy completion facility.
Start by typing the first few letters of a_very... in
the code chunk below and type TAB to complete the name.
a_very_long_name
## [1] 49
Let’s try inspecting the object again.
# What happens if you run:
a_vry_long_name
A_very_long_name
R is very sensitive to both case and spelling mistakes and won’t run unless things are spelled correctly and are in the right case. If you get an error, check your spelling! More than 80% of the time, this is likely the cause of your error!
A vector is a 1-dimensional ordered collection of
elements, all of the same type. It is the fundamental
data structure in R with a lot of useful properties.
We can extract an element from a vector by referencing its position.
Let’s make a new vector called character_vector using the
function c() which can be used to
c()ombine elements.
A vector is a 1-dimensional ordered collection of
elements, all of the same type. It is the fundamental
data structure in R.
Since R insists all elements in a vector have to be of the same type we then can have 5 types of vectors. These are:
There are more types of vectors, but for the purpose of our learning these 5 are sufficient.
c()In R to create a vector we have to call a function that will return a
vector object. The most commonly used one is the c()
function. Where the ‘c’ stands for ‘combine’.
The c() function takes several augments in and coerces
them into a common type of object and wraps them in a vector. e.g.
a_vec <- c(1,2,3,4,4,5)
a_vec
## [1] 1 2 3 4 4 5
## Defining the character vector:
character_vector <- c("WHK", "Khomas", "Namibia", "WHK", "Khomas", "Namibia")
Notice that when we specify words or characters, we use
"".
str()str(character_vector)
## chr [1:6] "WHK" "Khomas" "Namibia" "WHK" "Khomas" "Namibia"
R is able to recognise, thanks to the “” around our text that the
vector contains a character string chr.
length()length(character_vector)
## [1] 6
:.character_vector[3:5]
## [1] "Namibia" "WHK" "Khomas"
<-Try replacing the 4th element with your name:
character_vector[4] <- "Tuli"
character_vector
## [1] "WHK" "Khomas" "Namibia" "Tuli" "Khomas" "Namibia"
The same method used to extract information works for any type of
vector. Here we can define a new vector numeric_vector
containing the numbers 1, 2, 3, 4, and 5.
numeric_vector <- c(1:5)
str()str(numeric_vector)
## int [1:5] 1 2 3 4 5
Because we have specified whole numbers, R can either classify the
vector as and integer int or as numeric
num.
numeric_vector[1:2]
## [1] 1 2
c()Trying uncommenting and running the line below:
# numeric_vector[1,3]
Note that we can only select the 1 and 3 or 1, 3, and 4 elements
using c().
numeric_vector[c(1,3:4)]
## [1] 1 3 4
Let’s make a second numeric vector.
numeric_vector2 <- c(1.1,3:4)
## Check the structure
str(numeric_vector2)
## num [1:3] 1.1 3 4
You’ll notice that now when we check the structure, the vector is numeric (num). This is because we now have a number with a decimal place.
R is what is known in computer science as a dynamically typed language. R doesn’t require you to set the data type when you create a vector, instead it figures out what the best data type is for the object you are creating - numeric, character, factor, logical, etc.
However, sometimes the data type you want to work with, and the one R infers are not the same. You can change the data type using a range of in-built functions that enable you to convert data from one type to another.
as. functionsA useful set of functions are the as. functions, which
take the form as.<structure>. We can use this to
specify the structure of our numeric vector to be
numeric.
numeric_vector <- as.numeric(numeric_vector)
str(numeric_vector)
## num [1:5] 1 2 3 4 5
The structure of vectors becomes important when we use it to analyse different things.
character_vector <- as.factor(character_vector)
str(character_vector)
## Factor w/ 4 levels "Khomas","Namibia",..: 4 1 2 3 1 2
Note that now character_vector is now classed as a
factor Factor with 4 levels: “WHK”, “Khomas”, “Tuli”, and
“Phone”.
When you create a factor it uses an integer code to represent each level. So that “Whk” is both “Whk” and 1, “Namibia” is both “Home” and 2. You’ll notice that it automatically takes the alphabetic order when determining the factor levels. This means that even though “Phone” occurs 2nd in our character vector, it gets the integer code: 4. This is just a detail now, but becomes important in plotting, especially if you want to change the order in which your factors are plotted.
Factors are especially useful if we want to group data by a factor (e.g. country) for counting or summarising. For instance, “Home” and “Phone” each occur twice, whereas “Laurie” and “ET” each only occur once.
Vectors aren’t just containers for homogeneous data. As R is a vectorised language, this means operations are applied to each element of the vector automatically, without the need to loop through the vector.
This is powerful as at a low-level as computer chips are generally optimised for these types of calculations SIMD.
Let’s look at some examples
numeric_vector
## [1] 1 2 3 4 5
numeric_vector*3
## [1] 3 6 9 12 15
numeric_vector^2
## [1] 1 4 9 16 25
You can also multiply, divide, add, and subtract vectors of the same length.
x <- seq(from = 1, to = 20, by = 4)
x
## [1] 1 5 9 13 17
numeric_vector/x
## [1] 1.0000000 0.4000000 0.3333333 0.3076923 0.2941176
What happens when you run the following line of code?
x - numeric_vector
## [1] 0 3 6 9 12
Fill in the code chunks to answer the following questions
numeric_vector[]
Hint: you can use length() to find out how many elements
there are in the character vector.
character_vector[]
numeric_vector
y <- c(5:1)
w <- c(1:4)
numeric_vector/w
? or vignette respectively.?filter
vignette("dplyr")
CRAN Task View Looking for a package to carry out a particular analysis? Check out CRAN Task View
Stack Overflow Stack Overflow Check out Stack Overflow. This is one of the first calls where members from the R Community will help you answer questions.
Cheatsheets Many of the tidyverse packages come with their own cheatsheets, which are a quick reference on how to use various functions. It also gives a good overview of what functions are available.
Google. Google is your friend! Type “R help” followed by the warning or error message you received and I guarantee there will be someone who has had this problem before.
Meet ups and coding clubs Join a meet up or coffee and code group. Check out R-Ladies.
Further resources Looking to develop your learning further? Check out my trello board on R Resources for Data Science. This is still a work in progress, but I’m continually updating it with useful resources.
seq() as an example. Stat 545
University of British Columbia Blog by Jenny BryanThis Tutorials were adopted from the Office Of National Statistics of the UK Data Science Campus, Training materials, prepared by Dr. Laurie Baker.